WinoGrande

نویسندگان

چکیده

Commonsense reasoning remains a major challenge in AI, and yet, recent progresses on benchmarks may seem to suggest otherwise. In particular, the neural language models have reported above 90% accuracy Winograd Schema Challenge (WSC), commonsense benchmark originally designed be unsolvable for statistical that rely simply word associations. This raises an important question---whether these truly acquired robust capabilities or they spurious biases dataset lead overestimation of true machine commonsense. To investigate this question, we introduce WinoGrande, large-scale 44k problems, inspired by original WSC, but adjusted improve both scale hardness dataset. The key steps construction consist (1) crowdsourcing, followed (2) systematic bias reduction using novel AFLITE algorithm generalizes human-detectable associations machine-detectable embedding Our experiments demonstrate state-of-the-art achieve considerably lower (59.4%-79.1%) WINOGRANDE compared humans (94%), confirming high performance WSC was inflated Furthermore, report new results five related with emphasis their dual implications. On one hand, effectiveness when used as resource transfer learning. other all suggests extent which are prevalent such datasets, motivates further research algorithmic reduction.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Communications of The ACM

سال: 2021

ISSN: ['1557-7317', '0001-0782']

DOI: https://doi.org/10.1145/3474381